R Exercise 4. Clustering and classification

I have loaded the Boston dataset that contains information on housing values in suburbs of Boston. Data has 506 observations and 14 differeent variables like crime rate of the town, number of rooms in dwelling, pupil-teacher ratio, proportion of the lower status population, etc. All these variables are the key that help to eveluate the value of the houses in this area.

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## [1] 506  14

The graphical representation of the variables show that in some cases there exist a strong correlation between variables; also the accumaulation tends to be close to the edge.

## function (x, ...) 
## UseMethod("pairs")
## <bytecode: 0x7fc757bf6518>
## <environment: namespace:graphics>
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

However, I will also plot the correlation matrix in order to explore the data in more details.

## Warning: package 'tidyverse' was built under R version 3.3.2
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'ggplot2' was built under R version 3.3.2
## Warning: package 'tidyr' was built under R version 3.3.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
## select(): dplyr, MASS
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00

The correlogram is giving a more comprehensive picture of the correlation between the variables. Therefore, one can clearly observe a negative correlation between indus/dis (proportion of non-retail business acres per town to weighted mean of distances to five Boston employment centres), nox/dis (nitrogen oxides concentration to weighted mean of distances to five Boston employment centres), age/dis (proportion of owner-occupied units built prior to 1940 to weighted mean of distances to five Boston employment centres) and lstat/medv (lower status of the population to median value of owner-occupied homes). Positive correlation is observed in indus/nox (proportion of non-retail business acres per town to nitrogen oxides concentration), rad/tax (index of accessibility to radial highways to full-value property-tax rate per $10,000), nox/age. Having a closer look at the correlations, these seem to be logical.

However, in order to be able to classify the variable, it has to be standirdized so that it is comparable.

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
## [1] "matrix"
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610

Standardization of the variable has led to the fact that the range of the variable have decrease. Therefore, this standardized variable will be used in the further analysis

Also, I will create yet another categorical variable crime that will be created from the continuous one crim. I will remove crim variable from the dataset so that it does not affect the further analysis

## crime
##      low  med_low med_high     high 
##      127      126      126      127

After the necessary transformation, I will divide the data by the train (contain 80% of the data) and the test one (20% of the data) in order to proceed with the Linear Discrimination Analysis.

Now, I will fit the linear discriminant analysis on the train set.I will use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. LDA will find a combination of the explanatory variable in such way so that it can separate the classes of the crime variable the best

## [1] "matrix"
## crime
##      low  med_low med_high     high 
##      127      126      126      127
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2524752 0.2475248 0.2524752 0.2475248 
## 
## Group means:
##                   zn      indus         chas        nox         rm
## low       1.05906909 -0.8904696 -0.195131024 -0.8983270  0.4248582
## med_low  -0.08376633 -0.2488762  0.042638951 -0.5857561 -0.1091553
## med_high -0.37626407  0.1687568  0.190859195  0.3701591  0.1672778
## high     -0.48724019  1.0149946  0.003267949  1.0279369 -0.3725562
##                 age        dis        rad        tax     ptratio
## low      -0.9407757  0.9245847 -0.6879988 -0.7181283 -0.44272466
## med_low  -0.3745765  0.3412565 -0.5511961 -0.4646879 -0.03950843
## med_high  0.4854653 -0.3537503 -0.4639351 -0.3527581 -0.21131912
## high      0.8255786 -0.8450012  1.6596029  1.5294129  0.80577843
##               black       lstat         medv
## low       0.3789132 -0.75985909  0.502133500
## med_low   0.3117072 -0.15366948  0.003391685
## med_high  0.1038478  0.04037345  0.170827168
## high     -0.7220979  0.87102411 -0.644094735
## 
## Coefficients of linear discriminants:
##                 LD1         LD2          LD3
## zn       0.08257361  0.63785558 -0.953093815
## indus    0.06544082 -0.15227820  0.322891573
## chas    -0.09270221 -0.05534743  0.156489615
## nox      0.20379978 -0.82698298 -1.324012825
## rm      -0.20607407 -0.19708916 -0.154700829
## age      0.19818808 -0.47509452 -0.241948571
## dis     -0.09623188 -0.33783864  0.002967614
## rad      4.10376979  0.85820016 -0.072029213
## tax      0.02806966  0.13606484  0.516155991
## ptratio  0.13820802 -0.06341854 -0.301033032
## black   -0.12630186  0.04911348  0.117523420
## lstat    0.12744620 -0.30200432  0.242435870
## medv     0.19639703 -0.40340161 -0.214018461
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9594 0.0316 0.0090

In order to see the full picture of the obtained results, I have plotted a graph where different colours are resposible for different clsses of the variables. The arrow indicates the impact of each of the predictor variable in the model. From that, one can clearly see that rad (index of accessibility to radial highways) has the longest arrow and respectively impact.

Now, I will remove the crime from the data and will make a prediction for the new dataset.

## crime
##      low  med_low med_high     high 
##      127      126      126      127
##           predicted
## correct    low med_low med_high high
##   low       12      13        4    0
##   med_low    2      13        6    0
##   med_high   0       9       17    3
##   high       0       0        0   23

The data show that prediction for the high and low crime rates are correct ones. However, the prediction for the medium crime rates are not always correct

Analysing the data from another angel, I will cluster observation and perform the k-means model that will asign cluster based on the distance between variables. Distance between the variablesis is a measure of its similarity.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.119  85.620 170.500 226.300 371.900 626.000

The plot is showing the scaled pairs that are plotted against each other

## [1] 404  13
## [1] 13  3
## Warning: package 'plotly' was built under R version 3.3.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Also, here is another 3D plot where the color is defined by the clusters of the k-mean. It shows the same classes as in LDA model, however, without classification by the crime rate.

As a bonus I have decide to perform k-means on the original Boston data. I have taken the clusters variable as the target class.

## crime
##      low  med_low med_high     high 
##      127      126      126      127
## Call:
## lda(km2$cluster ~ ., data = Boston)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.1996047 0.5237154 0.2766798 
## 
## Group means:
##         crim       zn     indus       chas       nox       rm      age
## 1  0.7491682 10.49505 12.800396 0.05940594 0.5798416 6.189772 73.15743
## 2  0.2323824 17.69811  6.666981 0.07547170 0.4831913 6.468596 55.55623
## 3 12.0799686  0.00000 18.397286 0.06428571 0.6719000 6.004857 89.91143
##        dis       rad      tax  ptratio    black     lstat     medv
## 1 3.394095  4.801980 403.5743 17.73465 369.2717 12.875941 22.20693
## 2 4.867279  4.316981 276.0377 17.84943 388.9088  9.440679 25.97019
## 3 2.054707 22.878571 661.8357 20.12286 286.5699 18.572857 16.26143
## 
## Coefficients of linear discriminants:
##                  LD1           LD2
## crim     0.001231960 -0.0061488634
## zn       0.009459824 -0.0014772000
## indus    0.028954393 -0.0204754827
## chas     0.083882416 -0.1920615150
## nox      1.688279455  3.5675807263
## rm      -0.052904794 -0.0740868296
## age     -0.003403119 -0.0009436915
## dis     -0.133340151 -0.0756860539
## rad      0.128615007 -0.3498011613
## tax      0.021536890  0.0164145869
## ptratio  0.055921351 -0.0639644060
## black   -0.002807248  0.0001461840
## lstat   -0.001948397 -0.0002527576
## medv     0.009225193  0.0149027194
## 
## Proportion of trace:
##   LD1   LD2 
## 0.971 0.029
## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 =
## myscale * : zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 =
## myscale * : zero-length arrow is of indeterminate angle and so skipped

Variable nox (nitrogen oxides concentration) seems to be the most influential.